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Abstract 

An important aspect of Bayesian model selection is how to deal 
with huge model spaces, since exhaustive enumeration of all the mod- 
els entertained is unfeasible and inferences have to be based on the 
very small proportion of models visited. This is the case for the vari- 
able selection problem, with a moderate to large number of possi- 
ble explanatory variables being considered in this paper. We review 
some of the strategies proposed in the literature and argue that infer- 
ences based on empirical frequencies via Markov Chain Monte Carlo 
sampling of the posterior distribution outperforms recently proposed 
searching methods. We give a plausible yet very simple explanation 
of this effect, showing that estimators based on frequencies are unbi- 
ased. The results obtained in two illustrative examples provide strong 
evidence in favor of our arguments. 

Keywords: Bayesian model selection, Searching strategies, g-priors 



1 Inferences in large model spaces 

This paper is rooted in the model selection problem, that is with uncertainty 
surrounding the probabilistic model which, from an initial set Ai of candi- 
dates, better explains certain data y. In particular, we address the variable 
selection problem where the competing models differ about which subset of 
variables are to be included as explanatory covariates for y. 
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One special characteristic of the variable selection problem is that Ai, the 
model space, easily becomes extremely large. For instance, a problem with 
p = 40 potential covariates has 2^^ ^ 10^^ different models. The mere binary 
representation of such a model space would occupy 5 terabytes of memory. 

We focus on the difficulties that arise as a consequence of the very large 
size of Ai. We consider the problem from a Bayesian point of view and the 
context we use for the development of our ideas is the problem of variable 
selection in Gaussian regression models. 

The Bayesian approach to the problem is conceptually straightforward. 
Any feature of interest, say r, is a deterministic function of the posterior 
distribution over the model space. Examples of such features are the high- 
est posterior probability model (hereafter HPM), the inclusion probabilities 
of covariates or posterior predictions of a new value of the dependent vari- 
able. Unfortunately, three major difficulties arise when putting the Bayesian 
approach into practice: i) the choice of the prior distributions; ii) the com- 
putation of the integrated likelihood (or equivalently the Bayes factors) for 
single models in M and, in large model spaces, iii) the estimation of r, since 
its exact value is virtually unknown due to the size of A4. Benchmark pa- 
pers f o r each of these areas of res earch are respectivelv. iBerger and Pericchi 
fl200lh . lChib and Jehazko^ fl200lh and lCeorge and McCullochl (Il997h . 

Our work is basically concerned with iii), which is intimately related with 
strategies for exploring the model space (i.e. visiting a small proportion of 
models, hereafter denoted Ai*), since covering the whole model space is un- 
feasible. Our main aim is to shed new light on a topic (almost an implicit 
debate) that from time to time appears in the literature (see references be- 
low). The subject is about the estimation of r and more concisely whether it 
should be based on the empirical distribution (observed frequencies in Ji4*) 
or on the normalized Bayes factors of models in A^*. In the first approach, 
until quite recently the general one, models in A^* are visited according to 
a sampling scheme with the posterior distribution of the models in Ai as 
the stationary distribution. Markov Chain Monte Carlo methods are com- 
monly used for this task. In the second approach the emphasis is placed on 
visiting, usually without replacement, good models (i.e. with high posterior 
probability). In the rest of the paper, for ease of comprehension, we respec- 
tively refer to empirical and re-normalized for each of the approaches outlined 
above. The common use of empirical methods is MCMC methods for sam- 
pling from the posterior distribution plus an estimation of r via frequencies. 
On the other hand, the re-normalized approach uses algorithms for sampling 
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good models and estimations which are obtained via the re-normalized ana- 
lytical expre ssion of Bayes factors. Empiri c al methods have been p r opose d 
and used by ICe orge an d McCuUochl fll993h [George and McCuUochl f|l997h . 



Kuo and Mallick a998ll. iDellaportas et all fcOOOf). iNott and KohnI ( 120051 ). 



NtzoufrasI fl2002h . iNtzoufrasI fl2009[ ) and ICasella and Morenol fl2006F (just to 



mention a f e y). Papers more in favor of the re-normalized appro a ch are 



Clvde et all fcoiol). Eger and MolimI fl2005[ ). ICarvalho and ScottI fl2009h 



and lScott and Carvalhol ( l2009h . 

Re-normalized methods are motivated by the sound argument that the 
frequency of visits, in such huge model spaces, is a poor basis for estimation 
sin ce the number of re peat ed visits (if any) is very sm all. Several authors (see 
eg. IClyde et al.ll2010l and lScott and Carvalhol 120091 ) have argued in favor of 
the superiority of these procedures over the empirical ones. Nevertheless, as 
we further explain, our experience is quite the opposite, finding that in gen- 
eral empirical estimations outperform their re-normalized counterparts in key 
aspects. Our explanation for this effect is simple: empirical estimators are 
part icular cases of probability proportional to size sampling (PPS) estimators 



(see lLohrlll999[ ). and hence unbiased. Furthermore, these estimators have an 
associated measure of precision which can be very useful for the problem at 
hand. An appealing and well known extra property of empirical methods is 
that, although exploring high probability models is not their ultimate goal, 
these appear more frequently simply because they are more probable. 

Throughout this paper we develop and formalize the ideas outlined above. 
With this aim in mind, the paper has been organized in the following way. 

In Section [2] we formulate the problem of the variable selection considered. 
In order to keep the imp act of d i fficult ies i) and ii) above under control, 
we use Zellner's gf-priors (jZellnerl . Il986l ) which produce Bayes factors in a 
closed-form. The corresponding formulae are also briefly given in Section [2J 
In Section [3] the usual methods for obtaining A^* are reviewed. In Section HJ 
empirical estimators are expressed as PPS estimators and several properties 
are shown and discussed. In Section [5] we present the exact results of a 
moderately large problem (Ozone35) with p = 35 covariates obtained with 
parallel computing, programmed for the occasion with optimized C routines. 
We then compare these exact results with those obtained with empirical and 
re-normalized estimators, showing strong evidence in favor of the flrst ones. 
In Section [6] we analyze a much larger dataset (Ozone65) with 65 covariates 
(for which we do not have the exact answer). Finally, Section [7] contains a 
summary of the main conclusions in this paper. 
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2 Bayesian variable selection 

Let X = {xij} he an N X p full rank matrix and 7 = (71, . . . ,7^) be a 
p- dimensional vector of binary variables. Denote = each 7, 

let be the N x k.y design matrix corresponding to the columns with ones 
in 7. 

The variable selection problem we consider has 2^ competing models, 
each proposed as a plausible explanation of an N dimensional vector Y. 
More concisely 

M^:Y N^ial + X^f3^, a^I), 7 e {0, If. (1) 

In this problem, the model space A4 can be represented by {0, 1}^. The 
simplest model among the proposed ones is 

Mo-.Y NN{al,a'^I). 

Without loss of generality, posterior probabilities of the models can be 
expressed as 

PriM^\y) = CB^oPriM^), (2) 
where B^q is the Bayes factor of to Mq, Pr{M^) is the prior probability 



of My and 



7 

is the constant of proportionality. 

Bayes factors are the ratio of the marginal prior predictive distributions 
evaluated at y, that is, B^q = m^{y)/mo{y), where 



[y) = J Njyiy I al + X^(3^,a'^I) TT^{a, f3^,a) dad(3^da. 



The function vr^ is the prior distribution for the parameters under model M^. 
It is well known that this prior can be neither improper nor vague (with a 
ver y large variance) since the resulting Bayes factors are essentially arbitrary 



see 



Berger and Pericchill2001l) . Our default choice for this prior is the ^f-prior 



proposed by IZellnerl (119861 ): 

7r^(a,/3^,(7 I g) = a-' Nk^Xf^^ \ 0,ga^ {Xt^{I - N-'ll')X,r'] 
for 7 7^ and ttqIo, a) = for Mq. 
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The (yf-priors seem to be greatly inspired by Jeffreys' ide as (IJeffrevsi. Il961l) 
and th e cor responding extens i on to regression problems in IZellner and Slow 
fll980h and IZellner and Siowl fll984 ). The assignment of the constant g has 
been analyzed by several authors (see lLiang et al.ll2008l and references therein) 
This parameter g must increase with N to avoid an asymptotically degener- 
ate prior. The default assignment g = N gives rise to a 'unit information" 
prio r in the sense that the covariance matrix is corrected by the sample size 



see 



Raftervlll998f l. 



An alternative to the choice of the constant g is to assume, hierarchi- 
cally, a proper prior on g, say ^ig \ "f)- In general, the resulting prior 
for f3^ has heavy tails, this being an appealing characteristic of a model 
selection prior which is related t o properties like information consistency 
( iBayarri and Garcia-Donatd l2008l ) . Examples of such prior s are the mul- 
tivariate Cauchy for ^ proposed by Jeffreys-Z ellner-Siow ( Jeffreys 196ll . 
Zellner and Siowlll980l and lZellner and Siowlll984j ) w hich corresponds to 7r{g \ 



7) = Gamma ^{1/2, N/2), t he hyper- q priors of iLiang et al.l ( 120081 ). the 



m 



Maruyama and Georg i torn 



C onventional Robust prio r in iFortd ( 120081 ) and the extension of the g prior 



The (yf-prior provides closed-form expressions for the Bayes factors, 
fact it can easily be shown that 



In 



B^o{g) =[i + g 



ssi) ' 



l)/2 



(3) 



where SSE-y is the sum of the squared errors for My. Therefore, if all mod- 
els 7 can be visited, their posterior probabilities can be c alculated wit hout 
great computational effort. In terestingly, the proposals in iFortd ( 120081 ) and 
Maruyama and George (l2010f ) also lead to closed-form Bayes factors. 

Notice that independently of the approach adopted to construct the in- 
ferences (either empirical or renormalized) the above expression can easily be 
used to unequivocally identify the model that, within a given set of models, 
has the largest posterior probability (since of course it coincides with the 
model with the largest B^Q{g)). 



3 Search in large model spaces 

With the distinction introduced in Section [H empirical methods use the 
relative frequencies of models visited in a subset A^* C as the basis for the 
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estimation of r (the quantity of interest). On the other hand, re-normalized 
methods base this estimation on the use of the renormahzed expression of 
Bayes factors for models in A^*. Clearly, the way A4* is obtained can vary 
from one approach to the other. In this section we succinctly overview the 
methods used in the two approaches. 



A^* in empirical methods Markov Chain Monte Carlo (MCMC) meth- 
ods have provided decisive numerical support for the development of Bayesian 
methods over the last two decades. Bayesian model selection is not an excep- 
tion and the literature devoted to MCMC strategies for solving the problem 
is extensive. When estimations are based on the frequency of visits, the mod- 
els visited form approximately a sample from the posterior distribution. In 
this setting MCMC methods are an essential tool for generating the sample 
mentioned. 

A great majority of these proposals are to a c ertain extent based on 
the seminal work by iGeorge and McCuUochl (Il993[ ). greatly improved and 
extended in iGeorge and McCuUochl (11997 ) . A nuniber of interesting contr i- 



butions on this area are iKuo and M alliclg (119981). iDellaportas et al.l (120001) 



Nott and Kohnl ( l2005h . rNtzoufrasl ( l2002i) . iNtzoufrasl (l2009h and lCasella and Moreno 



mom . 

In Appendix |A] we describe the sampling strategy we propose for the 
model selection problem in ([1]) with hierarchical g priors. This is a straight- 
forward Gibbs sampling scheme that takes advantage of the integrated ex- 
pression in ([3]). The particular case for q-priors that w e used in the examples 
had already been suggested by IGeorge and McCuUochl ( 1l997l ). This sampling 
scheme can become ext remely efficie nt in combination with updating iden- 



tities for the SSE's (see lGentlell2007l and references therein) since it is built 



upon steps in which a variable is either added or deleted. 

With an MCMC sampling, given an initial model we obtain a sample 



of models M' 



{7 



(1) ^(2) 



,7^"', . . . ,7^"''} having Pr{My \ y) as the stationary 
distribution. This is a key characteristic of A^* that provides the ensuing em- 
pirical estimators of r with important characteristics described in Section HJ 



A^* in re -normalized methods The origins of this approach date back 
at least to IGeorge and McCuUochl (119971 ) who pointed out the possibility of 
using an MCMC sample Af* in combination with the normalized expression 
of Bayes factors as the basis for producing the required inferences. For those 
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models in M. not sampled, the posterior probability is assumed to be zero 
and for the rest Pr{M.y \ y) oc B^q, such that the sum over the models visited 
is one (i.e. probabilities are obtained by re-normalizing). Notice that this 
way, MCMC methods act as searching methods (on the grounds that good 
models should appear more frequently because they are more probable). 

When analyzing the method in the preceding paragraph two observa- 
tions arise. First, since frequencies are not used, visiting a model more 
than once is a waste of time, suggesting that it would be preferable to sam- 
ple without replacement. The second is that unvisited models in general 
have very low probability, so we should mainly focus on sampling good 
models (with high posterior probabilities). These two ideas have inspired 
the appearance of specific methods to search the model space for good 
models without repetition. E xamples of such p roposals are the Bayesian 
Adaptive Sampling method of IClyde et al.l ( 120101 ). the searching method of 



Berger and Molinal (I20051) in whi c h the Featu re Inclusion Stochas ti c Sea rch 



(FINCS) of IScott and Carvalhd (120091 ) and ICarvalho and ScottI mm is 
based. A common recursive idea in these methods is that the exploration of 
Ai is guided by estimates of inclusion probabilities of single covariates. This 
potentially leads to biased results because, it does not have to be the case 
that high inclusion probabilities point to the most probable models. We will 
see a demonstration of this effect in the examples in Section [5] and Section [61 



4 Inferences in model selection problems 

In a model selection problem, one relevant question is which of the proposed 
models is the most probable in the light of the data (the highest posterior 
probability model, HPM). In this situation, the quantity of interest is 

r = HPM = argmax5^o Pr{M^). 

Given a set of visited models A4*, the obvious and most precise way of 
estimating the HPM is common to empirical and re-normalized methods and 
is: 

f = HPM = argmax5^o Pr{M^). 

Notice that the goodness of the estimation of the HPM only depends upon 
the ability of the methods to search for good models in very large model 
spaces. 
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Nevertheless, very frequently, the quantity of interest r which we want 
to infer is of a different nature. A crucial aspect is that on many occasions 
this quantity implicitly depends on the normalizing constant. These can be 
written in terms of the expectation 

r(a) = Ep,(.|,)(a(M^)) = a(M^) Pr(M^ | y), (4) 

7 

where a(M^) is a known function of M^. Clearly, the posterior probability 
of a single model M^^ can be expressed as (jl]) with a(M^) = 1 if = M^*, 
and zero otherwise. There are many other examples of such representation 
of quantities of interest in the model selection problem. 



Example 1: Inclusion probabilities and the median probability 
model For a given explanatory variable xi the inclusion probability is defined 
as 

qi= Yl MM^\y). 

1- 7i=l 



These probabilities have interesting theoretical properties as shown in lBarbieri and Bergei 
l 2004 ) and are useful summaries of the posterior distribution. In particular, they 
can be helpful when the number of models is very large and the posterior prob- 
abilities of single models are so small that are very difficult to interpret. Apart 
from their intrinsic i nterest, inclusion pro babilities are the basis of the median 
probability model in \Barbieri and Bergei ^2004). This model, hereafter called 
M PM, is defined as the one with those variables with qi > 0.5 and the theory 



m 



Barbieri and Bergei suggests that the MPM model has optimal proper- 



ties and is better for prediction purposes than the HPM (a surprising fact). The 
probability, qi can be expressed as (dP with a{M^) = 1 if^i = 1, and otherwise. 

Inclusion probabilities are the most popular element in a set of useful sum- 
maries for the variable selection problem. For instance, we can be interested in 
the joint posterior probability of both xi and xy and this measure can also bewrit- 
ten easily in the form of lip. 



Example 2: Posterior probability of dimension of the 'true' model 

The probability that the 'true' model has exactly k* explanatory covariates is 

d{k*)= Pr{M^\y)- 

■y: k^=k* 

This corresponds to the expression in ^ with a{M^) = 1 if = k* , and 
otherwise. 
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Example 3: Model averaging techniques Suppose that A is a quantity 
of interest, then the posterior distribution Pr{A \ y) is Q with a{M^) = Pr(A | 
y,M^). 

What arises is, of course, the methodology called Model Averaging, which is 
just the Bayesia n way of accounting for the uncertainty regarding which the true 
model is (see eg. Hoeting et al. 199d ). 

Special mention should be made of the case where A is a future observable y^^^ ^ 
given certain values of the explanatory covariates x^^"^ . In this case Pr[y^^^ \ y) 
is the posterior predictive distribution. Notice also that summaries of this dis- 
tribution are special cases of (dP, like the posterior predictive expectation (with 
a(M) = E{y"'^'^ \ M^,y) ) or the posterior predictive variance. 



Now suppose Ai* = {j^^^j 



(2) 



7'^")} have been randomly simulated 



with replacement from Ai such that on each draw, each model has a 
probability Pr{M^ \ y) of being selected (we think of A^* as approximately 
the sample of models produced with MCMC methods, Section [3l). What 
arises is a probability proportional to size sampling (see iLohrl Il999l ) where 



the 'size' of each sampling unit (the models) is Pr(M^ 
estimator of r(a) in (j4]) under this sampling scheme is 



y). The usual 



r a 



n 

-E 

n ^ 



a(M^u))PriM^U) \ y) 



y) 



U), 



usually known in the literatu re as the Hansen-Hurwitz fo r random sampling 



with replacement estimator (IHansen and Hurwita Il943 ). It can be easily 



shown that r(a) is an unbiased estimator of r(a) (jLohiill999l ). 

As a consequence, the Hansen-Hurwitz estimator of the posterior proba- 
bility of a single model M* is the frequency of M* in Ai*. Likewise, Hansen- 
Hurwitz estimators of the quantities in the previous examples are 



n 



d{k*) 



r-ii 



n 



and 



Pr(A I y) 



1 

- VPr(A 



U), 



Of course, these are just the em^jznca/ estimators (as labeled in Section 1) 
based on the frequency of visits. This correspondence is a key point which 
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provides theoretical support to the arguments introduced in Section 1 and 
our experience (partially presented in the following sections) regarding the 
extremely good results of empirical methods. It now becomes obvious that 
these estimators are implicitly based on the analytical expression of the Bayes 
factor through the sampling mechanism used. Moreover, they enjoy the desir- 
able properties of Hansen-Hurwitz estimators, as for example unbiasedness. 

Furthermore, it is quite interesting that these estimators come with a 
measure of precision, a characteristic that has remained unnoticed in this 
context until now. This may have interesting applications and important 
consequences as, for instance, knowing when n gives enough precision in the 
estimation of the quantity of interest. If the draws o n A4* are independently 



obtained, the variance of f(a) is (see eg. iLohrl 119991 ) 



In the case that the quantity of interest is a probability p (e.g. probability of 
a single model or an inclusion probability), V{f{a)) = n~^p{l — p) which is, 
of course, bounded above by l/(4n) (this bound being a reasonable measure 
of the variability for probabilities not very close to zero). This provides 
an accurate idea of the precision achieved with the procedure and can be 
used (depending on the magnitude of the probability being estimated), for 
example, to decide the number of draws needed. 

Also useful is that an unbiased estimator of the variance of f(a) is 

n(n — 1 ' 
j=i 

At this point it could argued that these results are of limited importance 
in practice since in MCMC sampling schemes we are not exactly sampling 
from the posterior and draws are dependent. Strictly speaking this is true, 
although our experience (partially shown in the following sections) is that 
these properties (the unbiasedness of f(a) and the expression for Viji^a)) 
hold quite accurately in practice. On the other hand, notably, these are 
basic assumptions that underlie any analysis solved with MCMC methods, 
the literature containing plenty of techniques for improving the results of 
MCMC methods in this sense. Among them, probably the most popular 
and simple to implement yet very effective are thinning (to systematically 
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keep one simulation out of several) and burning (to reject some of the first 
simulations). 

The estimator of r(a) within the re-normalized approach is 

J2 a{M,)Pr{M,\y) 
7eM* 

where the posterior probabilities of single models are obtained by re-normalizing 
the Bayes factors, that is 

The properties of these estimators are in general difficult to derive. The bias 
of such estimators for the post erior probability for sing le models has been 
the subject of a recent study in IClyde and Ghosh 



5 Example I: a large problem with an exact 
solution 

Mainly for comparative purposes but also to report the exact results on a 
moderately large problem (something that has not been done before to the 
best of our knowledge) , here we present the exact solution for a problem with 
p = 35 covariates, and hence with 34, 359, 738, 368 ~ 3 ■ 10^° different models. 
Having the exact results of a large problem seems to us the most informative 
and clarifying way of comparing the performance of searching methods. 

It is gen erally understood that problems with p larger than 25-30 are 
intractable, (IClyde et al.l . I2OIOI ). and we are considering a step beyond this 
limiting size. The results we obtained were derived using a cloud with 150 
processors and took appr oximately 20 hours to run. The code was written 
in C, with the gsl library ( iGalassi et al.l . 120091 ). The source code is available 
upon request. 



The data we analyzed were previously used by lCasella and Moreno! (120061 ) 
and iBerger and Molinal (120051 ) and concern = 178 measures of ozone 
concentration in th e atmo sphere. Details on the data can be found in 
Casella and Moreno! (12006! ). Of the 10 main effects originally considered, 
we only ma k e use of those with an atmospheric meaning, as was done by 
Liang et al.! (12008! ). Then we have 7 main effects which, jointly with the 
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Variable Description 

y Response = Daily maximum 1-hour-average ozone reading (ppm) at Upland, CA 

Xi Month: 1 = January, . . . , 12 = December 

X2 Day of month 

X3 Day of week: 1 = Monday, . . . , 7 = Sunday 

X4 500-millibar pressure height (m) measured at Vandenberg AFB 

X5 Wind speed (mph) at Los Angeles International Airport (LAX) 

X6 Humidity (%) at LAX 

X7 Temperature (Fahrenheit degrees) measured at Sandburg, CA 

xg Inversion base height (feet) at LAX 

xg Pressure gradient (mm Hg) from LAX to Daggett, CA 

xjo Visibility (miles) measured at LAX 

Table 1: Description of variables used in Example I and Example II 



quadratic terms and second order interactions, produce the above mentioned 
p = 35 possible regressors. For comparative purposes, we keep the original 
notation of the variables defined in Table [U We call this dataset Ozone35, 
for which we now present the exact results. We use the ^f-prior with g = N 
and a constant prior for the prior probabilities of models. 

The sum of all Bayes factors (the proportionality constant) is 

5^ 5,0(^7) = 1.13 10^°. 

7 

The highest probability model, HPM, has covariates {1, Xio, x^Xq, XgXg, a;^, XjXiq] 
and has a posterior probability of 0.0009, with a Bayes factor (in its favor and 
against Mq) of 1.0210^^. The first 1000 most probable models accumulated 
a total probability of 0.07 and a sum of Bayes factors (expressed in deci- 
mal logarithm) of 48.92 (this value is used later). Inclusion probabilities of 
each variable are in Table |2l Hence, the median inclusion probability model, 
MPM, is {1, Xg, xeXy, xeXg, xtXio} which has a posterior probability which is 
twenty three times lower than the probability of the highest posterior prob- 
ability. Moreover, there are 851 models which are more probable than the 
median inclusion probability model. 

We then run the following methods ten times, each with n = 10000 
iterations. 

Freq Gibbs sampling with algorithm in Appendix |X] the with initial model 
M^^) = Mq (we did not observe differences starting with the full model 
or with a randomly chosen model). For a fair comparison among the 
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methods compared we did not exclude any model sampled and did not 
use any burning period. 



BAS Bayesian adaptive sampling of IClyde et al. fl2010l ) through the corre 



sponding R-package BAS. As recommended (personal communication), 
we used method = "MCMC + BAS" , which uses an MCMC method to 
initialize the search (this is a clear improvement over other options 
like eplogp, which uses a rough approximation of inclusion probabil- 
ities with p- values to initialize the search). We tuned the parameter 
update = 500 so that sampling probabilities were updated every 500 
iterations. 



SSBM The Stochastic Search in lBerger and Molina! (120051 ). This method was 



originally proposed for a particular prior but the searching algorithm 
can be easily adapted to accommodate the g-phoi. 

The estimates computed in Freq are proportional to size (see Section Hj) 
and hence based on frequency of visits, and in BAS and SSBM estimators are 
based on the renormalization of the Bayes factors. Hence, Freq is a particular 
method within the empirical approach, while BAS and SSBM are methods of 
the re-normalized approach (using the labels introduced in Section [1]). The 
results are summarized in Table [2l 

For the first run of each method, in Table [2] we present estimates of the 
inclusion probabilities, the MPM and the HPM. With this same run we esti- 
mated the standard deviation of the estimators of the inclusion probabilities 
with Freq using ([5]). In addition, with the ten runs we computed the ob- 
served standard deviation as this provides a measure of variability in BAS 
and SSBM (for which an expression like the one in [5] does not exist). 

The main conclusions that we have extracted from the former simulations 
can be summarized as follows: 



Regarding the MPM and inclusion probabilities Of the ten exper- 
iments conducted, Freq correctly identified the MPM ten times while with 
BAS and SSBM the estimated MPM and the real MPM did not coincide in any 
of the ten runs. Also, Freq provides very accurate estimations of the (ex- 
act) inclusion probabilities with a small variability. This confirms the high 
efficiency of such estimators. The observed variability with BAS and SSBM is 
large, so in general we expect large differences in repetitions of these methods 
in a similar manner to that observed in this experiment. 
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One great advantage of empirical methods over re-normalized is that the 
first come with a measure of precision in the estimates This measure 
can be legitimately criticized since it is just an approximation due to the 
dependency between the simulations. Nevertheless, in this experiment these 
estimates and the observed standard deviation are quite close to each other, 
suggesting that (E]) is quite a reasonable estimator. 

The most worrisome aspect observed of BAS and SSBM is that they are 
clearly biased: for certain covariates we have to move from the point estima- 
tion more than 10 times the standard deviation to cover the exact value of 
the inclusion probabilities (see eg. xqXs in BAS and x^xio in SSBM). The na- 
ture and origin of this bias has an easy interpretation after a careful reading 
of the table. In BAS, six inclusion probabilities are overestimated, of which 
five are in the estimated HPM; the rest are underestimated. A similar pat- 
tern is found in SSBM. This means that inclusion probabilities within these 
methods are very influenced by the estimated HPM, leading to a bias in the 
direction of the HPM model. This effect is, in our opinion, the manifesta- 
tion of a search in the model space for good models guided by the inclusion 
probabilities. 

Regarding the HPM and probability mass discovered One interest- 
ing question is which method is visiting better models. BAS and SSBM are, 
in some sense, specifically designed with this aim while this characteristic 
is presumed in MCMC methods (since more probable models should be vis- 
ited just because they are more probable). In our experiment, BAS correctly 
identified the HPM nine times while SSBM did it five times. The exact HPM 
was among the visited models in Freq in the ten runs, showing that Freq is 
visiting good models. 

Finally, we calculated the mean and standard deviation (over the ten 
runs) of the sum of the Bayes factors of the 1000 (in decimal logarithm) 
most probable different models explored (to be compared with the exact 
value given above of 48.92). The results were 48.77(0.01), 48.64(0.05) and 
48.50(0.20), for Freq, BAS and SSBM respectively. In this respect, the three 
methods analyzed behave similarly, although Freq gives more stable answers. 
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0.018 
(0.007) 
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Table 2: Inclusion probabilities (exact qi and estimates qi in one run of 10000 
iterations) for the Ozone35 data set. Also, V{qi) is the estimated variance ([5]) 
using this same run and S{qi) is the deviation of the estimators observed in ten 
independent identical runs. Symbols (f) for those variables in the estimated MPM 
and asterisks (*) for those variables in the estimated HPM. 



6 Example II: A much larger problem 

We now consider the full Ozone dataset with the 10 main effects, the quadratic 
terms and the second order interactions. The same problem has been consid- 
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ered by iBerger and Molina (120051 ) and, as before, we keep the same notation 
for the covariates as there. This problem has p = Q5 and hence 2^^ 3.7 10^^ 
models in A^. In what follows we call this dataset Ozone65. The size of Ai 
precludes having the exact answer to the problem. To have an approximate 
idea of the unfeasibility, notice that with the C code that we used for the 
Ozone35 it would take more than 350 years to compute the answer using a 
cloud with 10^ processors. We cannot wait that long. 

For this dataset we repeated the comparison in Section [5] and performed 
10 different runs, now each with n = 100, 000 iterations, of Freq, BAS and 
SSBM. In Table [3] we present the statistics of all the variables included in the 
estimated HPM and MPM in any of the runs of the methods being compared. 
In essence, these results are in clear agreement with our findings in Ozone35, 
and confirm the conclusions drawn there. 



Regarding the MPM and inclusion probabilities It is unknown which 
model is the MPM (and it will probably never be known) but results with 
Freq provide a very reasonable and consistent picture of the solution. In 
Freq, except for one run, there is unanimity in the estimation of the MPM. 
Furthermore, and quite appealing is that we can give an explanation of the 
disagreement in terms of errors in the estimation. The discordant run differs 
from all the others in that it includes xix^. This variable has an estimated 
inclusion probability of 0.497 with an estimated error of 0.002. 

In BAS and SSBM results vary considerably over the different runs. In 
BAS (SSBM) 8(10) different models were estimated as the MPM and hence, at 
least 8(9) times this method has incorrectly identified the true MPM. More 
worrisome is that these bad results do not seem to be due to variability. 
We can find a more likely explanation in Table [3] where we can clearly see 
that, in many occasions, the estimated MPM mimics the estimated HPM. 
For instance, Xq and x^Xq (which are in the estimated HPM) are always in 
the MPM estimated by BAS. This also seems to be the case for X4Xf and XqX7 
in SSBM. We interpret these results as a manifestation of the bias produced 
with methods conceived to look for good models. 



Regarding the HPM and probability mass discovered In this aspect, 
the three methods behave quite similarly, perhaps BAS and Freq perhaps 
performing slightly better than SSBM. The best model found in the whole 
experiment, the ten runs of the three different methods, had a Bayes factor 
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(in its favor and against the null) of 50.87 (expressed in decimal logarithm). 
This model, identified in Table [3] with asterisks, was visited in four of the 
ten runs by BAS, in three runs by Freq and in one run by SSBM. The means 
(standard deviations) over the ten runs of the sum of the Bayes factors of 
the 1000 (in decimal logarithm) most probable different models explored 
were 52.77(0.02) in Freq, 52.78(0.15) in BAS and 52.78(0.16) in SSBM. These 
results confirm the popular hypothesis that good models also show up when 
sampling from the posterior distribution. 

7 Summary and main conclusions 

In the context of Bayesian model selection with very large model spaces Ai, 
quantities of interest r in the problem have to be estimated since their ex- 
act value is, in practice, unknown. This is mainly because the underlying 
normalizing constant, whose determination would imply the computation of 
Bayes factors for all the models, is virtually unknown. In this situation, such 
estimates have to be constructed from a sample of models A^* of and can 
be derived either by using the empirical distribution or through renormaliza- 
tion of the Bayes factors. Within the first approach, M* has to be a sample 
from the posterior distribution and is usually obtained with MCMC sam- 
pling. In the second approach, Ai* does not necessarily have to be obtained 
with probabilistic-based mechanisms and the emphasis is normally placed on 
sampling good models. We labeled each of these approaches empirical and 
re-normalized. We have shown that empirical estimates are in general, under 
the common assumptions in MCMC sampling, unbiased. Also, the uncer- 
tainty regarding the estimations can be easily derived, making the empirical 
approach very appealing. 

We have compared several methods in a moderate to large problem with 
p = 35 covariates for which we have derived the exact answers, and on a larger 
problem with p = 65 with an unknown solution. With respect to sampling 
good models and in particular the highest posterior probability model, (a 
problem for which the normalizing constant is not needed), empirical and re- 
normalized methods behave quite similarly. Nevertheless, in the estimation 
of other important parameters like the inclusion probabilities, re-normalized 
methods can be strongly biased. 
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Table 3: For the Ozone65 dataset, the number of times each covariate is included 
in the estimated HPM and the estimated MPM in ten independent runs of n = 
100000 iterations of Freq, BAS, and SSBM methods. Asterisks identify the best 
model encountered in the full experiment. Also, g^'s are the estimation of the 
inclusion probabilities from the first run of Freq and Visii) is their estimated 
variance ([5]) using this same run. 
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A Gibbs sampling algorithm for hierarchical 
^-priors 

Once the parameters /3^, a have been analytically integrated out (see [3]), the 
only unknown parameters in the problem are g and the components in 7. 
Those have full conditional distributions: 

7i I 7i, . . . , 7i_i, 7i+i, . . . , 7p, 5(, 2/ ~ Bernoulli (pi), (6) 

where 



Pi = 





7 = a)Pr{Ma) 




\-f = a)Pr{Ma) + Bboig)7r{g 


7 = b)Pr{Mh) 



where 

a = (71, • • • ,7j-i, l,7i+i, • • • ,7p), 

and 

b = (7i,...,7i_i,0,7i+i,...,7p). 
The full conditional for g is 

fig I 7,y) oc B^o{g)T^{g I 7), 

which can easily be sampled via Metropolis-Hastings with an obvious pro- 
posal: (?* ~ vr(5f I 7). The case for ^f-priors and any other prior that leads to 
a closed expression is a particular case of the above with 7r((7 | 7) = 1 and 
the step for g is not used. 
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